Exploratory Data Analysis on US Real Estate¶

estate

INTRODUCTION¶

The United States is the largest economy in the world. Based on realtrends article, real estate accounts for 16.9% of US's GDP on 2021 which is considered a major GDP contributor. The following analysis aims to provide viewers with real estate insights that may facilitate general understanding and potential investment decisions through market trends and factors influencing the real estate market.

In [1]:
# Importing Libraries 
import pandas as pd 
import numpy as np
import seaborn as sns 
import statsmodels.api as sm
import matplotlib.pyplot as plt
from IPython.display import HTML

HTML('''
    <script>
        code_show=true; 
        function code_toggle() {
         if (code_show){
         $('div.input').hide();
         } else {
         $('div.input').show();
         }
         code_show = !code_show
        } 
        $( document ).ready(code_toggle);
    </script>
    <form action="javascript:code_toggle()">
        <input type="submit" value="Toggle Code">
    </form>
''')
Out[1]:
In [2]:
# Importing data 
raw_real_estate = pd.read_csv("realtor_data.csv")

DATA CLEANING AND VALIDATION¶

This stage contains initial exploration of the real estate dataset. It involves identifying missing values, reformating certain columns for analysis purpose, filtering out duplicated datas and keep thedata required for analysis purpose.

In [3]:
# Initial Exploration
# Identifying the shape of the dataset 
shape = raw_real_estate.shape
print(shape)
# identifying the data types of all the columns
data_types = raw_real_estate.dtypes
(923159, 12)
In [4]:
# Identifying the number of missing values in each the columns 
missing_values = raw_real_estate.isna().sum().reset_index()
missing_values.columns = ['Column_Names', 'Missing_value_Count']
display(missing_values)
Column_Names Missing_value_Count
0 status 0
1 price 71
2 bed 131703
3 bath 115192
4 acre_lot 273623
5 full_address 0
6 street 2138
7 city 74
8 state 0
9 zip_code 205
10 house_size 297843
11 sold_date 466763
In [5]:
# Changing the date column into "date" data type 
raw_real_estate['sold_date'] = pd.to_datetime(raw_real_estate['sold_date'], format = '%Y-%m-%d')
In [6]:
# Removing rows with missing values in certain columns
raw_real_estate = raw_real_estate.dropna(subset = ['price', 'bed', 'bath', 'acre_lot', 'zip_code', 'house_size', 'sold_date'])
In [7]:
## Filtering for datas for years between year 2000 and 2022 (ordered)

# Finding the max and min dates in the dataset 
raw_real_estate['sold_date'].agg(['min', 'max'])

# Filtering the date from the range of dates in order
raw_real_estate = raw_real_estate[(raw_real_estate['sold_date'] >= '2000-01-01')
                     & (raw_real_estate['sold_date'] <= '2022-12-31')].sort_values("sold_date")

# Filtering the columns needed for further analysis
final_real_estate = raw_real_estate.loc[:, (raw_real_estate.columns != 'street') ]

# Removing Duplicate Values 
final_real_estate = final_real_estate.drop_duplicates(subset = ["full_address", "sold_date"])

# Reformating the price column 
final_real_estate['price'] = final_real_estate['price']/1000

#  Create the column called "year"
final_real_estate['year'] = pd.DatetimeIndex(final_real_estate['sold_date']).year

# Renaming the price column
final_real_estate.rename(columns = {"price": "Sale_Price_Thousands"}, inplace = True) 
In [8]:
# Adding a new column called Price per Squre Foot (Value for Money)
final_real_estate['price_per_sqft'] = (final_real_estate['Sale_Price_Thousands']/final_real_estate['house_size']) * 1000
In [9]:
# Adding a new column called room ratio (feeling of balance of the house)
final_real_estate['room_ratio'] = final_real_estate['bed']/final_real_estate['bath']
In [31]:
# Downloading the new code
final_real_estate.to_csv("final_real_estate.csv", index = False)
In [10]:
display(final_real_estate.head(10))
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
920877 for_sale 225.0 3.0 2.0 1.18 21 Morris Dr, Newburgh, NY, 12550 Newburgh New York 12550.0 1245.0 2000-01-03 2000 180.722892 1.500000
634864 for_sale 451.0 4.0 3.0 0.29 16 Underwood Ct, Burlington, NJ, 08016 Burlington New Jersey 8016.0 2085.0 2000-01-04 2000 216.306954 1.333333
627427 for_sale 289.0 1.0 1.0 0.13 113 Cleveland Ln, Rockaway, NJ, 07866 Rockaway New Jersey 7866.0 915.0 2000-01-04 2000 315.846995 1.000000
229776 for_sale 350.0 3.0 2.0 0.15 31 Benefit St, Pawtucket, RI, 02861 Pawtucket Rhode Island 2861.0 1500.0 2000-01-04 2000 233.333333 1.500000
448440 for_sale 365.0 2.0 1.0 0.95 49 Tinker Hill Rd, Washington, CT, 06777 Washington Connecticut 6777.0 720.0 2000-01-04 2000 506.944444 2.000000
590524 for_sale 949.9 5.0 5.0 5.04 5 Country Ln, Tewksbury Township, NJ, 07830 Tewksbury Township New Jersey 7830.0 4508.0 2000-01-04 2000 210.714286 1.000000
778538 for_sale 695.0 6.0 2.0 0.07 1324 Star Ave, Elmont, NY, 11003 Elmont New York 11003.0 1770.0 2000-01-04 2000 392.655367 3.000000
612420 for_sale 85.0 3.0 1.0 0.02 1126 Chestnut St, Wilmington, DE, 19805 Wilmington Delaware 19805.0 1075.0 2000-01-04 2000 79.069767 3.000000
449429 for_sale 349.0 3.0 2.0 0.18 11 Dartmouth Ln, Danbury, CT, 06810 Danbury Connecticut 6810.0 1170.0 2000-01-05 2000 298.290598 1.500000
686596 for_sale 154.9 4.0 1.0 0.22 216 South Ave, Bridgeton, NJ, 08302 Bridgeton New Jersey 8302.0 1008.0 2000-01-05 2000 153.670635 4.000000

KEY ANALYSIS STAGE¶

Overall Market Trend and Analysis¶

Identifying total real estate sales volumes from year 2000 to 2022¶

A usual business cycle can be identified in the time series plot below where the peaks occured at 2005 and 2017 and the downturn at 2000, 2011 and 2022. The downturn occured at 2011 maybe considered weak recoveries from the 2008 global financial crisis and 2022 downturn maybe caused by the covid pandemic and the rise of interest rates.

In [11]:
# Finding the sum of real estate sales Volume in each of the year 
General_Sales_Volume = final_real_estate['year'].value_counts().sort_index()

# Plotting the Real Estate Sales Volume over the years 
Sales_Volume_Plot = General_Sales_Volume.plot(title = "Total Real Estate Sales Volume from 2000 to 2022", 
                                              xlabel = "year", ylabel = "Sales_Volume", kind = "line")

Identifying total real estate sales value from year 2000 to 2022¶

The real estate sales value time series plot below shows a similar pattern as the sales volume time series plot. This may be due to the large value each real estate carries which makes changes in real estate sales very obvious.

In [12]:
# Finding the sum of real estate sales Value in each of the year 
General_Sales_Values = pd.DataFrame(final_real_estate.groupby('year')['Sale_Price_Thousands'].sum())
Sales_Revenue_Plot = General_Sales_Values.plot(title = "Total Real Estate Sales Values from 2000 to 2022", 
                                               ylabel = "Sales_Price_Thousands", kind = "line", legend = False)

Which State have the highest median property price?¶

Currently Virgin islands is a state with the highest median property price however, there were only datas on 2 properties listed on Virgin Island and one of the property price was heavily skewed thus creating an outlier. Therefore, its not a good indicator of a median prices in Virgin Islands and thus will be removed from the chart for more accurate analysis.

In [13]:
# find the median house prices in each state
median_prices = final_real_estate.groupby("state")['Sale_Price_Thousands'].median().reset_index().sort_values(by='Sale_Price_Thousands', ascending=False)
median_prices_except_highest = median_prices.iloc[1:]

plt.figure(figsize=(12, 6))
sns.barplot(data=median_prices_except_highest, x='state', y='Sale_Price_Thousands', palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("State")
plt.ylabel("Median Property Price (Thousands $)")
plt.title("Median Property Price by State")
plt.show()
In [14]:
# identifying all the properties from Virgin Islands
virgin_islands_rows = final_real_estate[final_real_estate["state"] == "Virgin Islands"]
display(virgin_islands_rows)
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
11371 for_sale 950.0 5.0 4.0 0.99 46 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 5000.0 2013-10-11 2013 190.000000 1.250000
10351 for_sale 6899.0 4.0 6.0 0.83 10 Water Isle, Saint Thomas, VI, 00802 Saint Thomas Virgin Islands 802.0 4600.0 2018-04-05 2018 1499.782609 0.666667

Identifying the Sales Volume of the top 5 states¶

New Jersey has the highest Real Estate sales volume across 22 years while New York is slightly ahead of Pennsylvania.

In [15]:
#2. Identifying the sales volume based on states 

# Identifying the number of states available in the US
final_real_estate['state'].nunique()

# Identifying the top 5 states in terms of real estate sales volume
final_real_estate['state'].value_counts().sort_values().tail(5).plot(kind = 'barh', 
                                                                     color = ['Green', 'Blue', 'Orange', 'Brown', 'Purple'])
plt.ylabel("States")
plt.xlabel("Real Estate Sales Volume")
plt.title("Top 5 US States Real Estate Sales Volume")
Out[15]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Volume')

Identifying the sales volume of the top 5 US states over the years¶

New Jersey has the highest Real Estate sales volume each year for 22 years while the New York state has fallen behind Pennsylvania in recent years. Furthermore, Massachusetts have the least variation and lowest real estate sales for the past 12 years. Overall the real estate sales volume in the top 5 states follows a similar trend over the past 22 years.

In [16]:
# Identifying the real estate sales volume in the top 5 states over the years 

# Filter out datas that only contains the top 5 states in terms of volume 
top_5_states = final_real_estate[final_real_estate.state.isin(["New Jersey", "Connecticut", "New York", "Pennsylvania", "Massachusetts"])]
top_5_states_sales_volume = pd.DataFrame(top_5_states.groupby(['year', 'state']).size()).rename(columns = {0: "State_Count"})

# Plotting the 
Palette = ["Brown", "Green", "Purple", "Orange", "Blue"]
sns.set_palette(Palette)
top_5_states_sales_volume = sns.relplot(x = "year", y = "State_Count", 
                                        data = top_5_states_sales_volume, kind = "line", hue = "state")
top_5_states_sales_volume.fig.suptitle("Top 5 States Real Estate Sales Volume from year 2000 to 2022", y = 1)
Out[16]:
Text(0.5, 1, 'Top 5 States Real Estate Sales Volume from year 2000 to 2022')

Identifying the sales value of the top 5 US states¶

The top 5 US states for sales volume and sales value are the same, however, the New York state has taken the first position having the most real estate sales value for the past 22 years.

In [17]:
#2b Identifying the top 5 states based Sale_Price_Thousandson real estate sales value
final_real_estate.groupby('state')['Sale_Price_Thousands'].sum().sort_values().tail(5).plot(kind = "barh", 
                                                color = ['Blue', 'Green', 'Brown', 'Purple', "Orange"])
plt.xlabel("Real Estate Sales Value")
plt.ylabel("States")
plt.title("Top 5 US States Real Estate Sales Value")
Out[17]:
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Value')

Median Prices Per Sqft Across the States¶

From the Prices per Sqft analysis, it shows only New York and Massachusetts have median price per sqft that is above 300 dollars while other states have quite similar median price per sqft. This would provide some insights of value for money and the prices of properties in each states, however it is important to note that this metric does not consider qualitative factors such as property layout and design and the benefits specific locations bring.

In [18]:
filtered_data = final_real_estate[~final_real_estate['state'].isin(['Puerto Rico', 'Virgin Islands'])]
state_pricespersqft = filtered_data.groupby('state')['price_per_sqft'].median().sort_values()

# generate unique colours for each state using seaborn
colors = sns.color_palette('tab20', len(state_pricespersqft))

state_pricespersqft.plot(kind = 'barh', color = colors)


plt.xlabel("Prices per Sqft")
plt.ylabel("States")
plt.title("Prices per Sqft Across the US States")
Out[18]:
Text(0.5, 1.0, 'Prices per Sqft Across the US States')
In [19]:
state_counts = final_real_estate['state'].value_counts()
display(state_counts)
New Jersey        7348
Connecticut       5113
New York          4240
Pennsylvania      3999
Massachusetts     1920
Rhode Island      1152
New Hampshire      682
Delaware           602
Vermont            432
Maine              363
Virgin Islands       2
Puerto Rico          1
Name: state, dtype: int64

Identify the most expensive real estate sold in each year¶

In [20]:
#3 Identify the Information of the Most expensive real estate in each year.
Most_Expensive_Real_Estate = top_5_states.loc[top_5_states.groupby('year')['Sale_Price_Thousands'].idxmax()]
display(Most_Expensive_Real_Estate)
status Sale_Price_Thousands bed bath acre_lot full_address city state zip_code house_size sold_date year price_per_sqft room_ratio
922453 for_sale 39000.0 8.0 13.0 8.23 555 Lake Ave, Greenwich, CT, 06830 Greenwich Connecticut 6830.0 18954.0 2000-07-17 2000 2057.613169 0.615385
649077 for_sale 13950.0 10.0 14.0 16.93 770 Godfrey Rd, Villanova, PA, 19085 Villanova Pennsylvania 19085.0 19000.0 2001-03-07 2001 734.210526 0.714286
560341 for_sale 32500.0 8.0 8.0 0.05 20 E 73rd St, New York, NY, 10021 New York New York 10021.0 11371.0 2002-03-21 2002 2858.147920 1.000000
409975 for_sale 60000.0 9.0 10.0 400.00 450 Brickyard Rd, Woodstock, CT, 06281 Woodstock Connecticut 6281.0 18777.0 2003-08-28 2003 3195.398626 0.900000
495469 for_sale 25000.0 6.0 11.0 0.07 70 Broad St, New York City, NY, 10004 New York City New York 10004.0 19478.0 2004-07-12 2004 1283.499333 0.545455
204657 for_sale 15000.0 6.0 6.0 4.50 2 Squibnocket Rd, Chilmark, MA, 02535 Chilmark Massachusetts 2535.0 4706.0 2005-10-04 2005 3187.420314 1.000000
734864 for_sale 18500.0 3.0 5.0 1.94 151 E 58th St Ph 55W, New York City, NY, 10022 New York City New York 10022.0 4500.0 2006-07-11 2006 4111.111111 0.600000
743986 for_sale 20500.0 4.0 5.0 1.33 New York City, NY, 10023 New York City New York 10023.0 3333.0 2007-12-20 2007 6150.615062 0.800000
743290 for_sale 50000.0 6.0 8.0 0.22 995 5th Ave Unit Ph, New York City, NY, 10028 New York City New York 10028.0 6891.0 2008-09-12 2008 7255.840952 0.750000
64850 for_sale 15000.0 8.0 10.0 157.00 47 White Bridge Rd, Chatham, NY, 12136 Chatham New York 12136.0 14058.0 2009-10-20 2009 1067.008109 0.800000
917730 for_sale 10600.0 8.0 11.0 4.02 488 West Rd, New Canaan, CT, 06840 New Canaan Connecticut 6840.0 11923.0 2010-07-12 2010 889.037994 0.727273
743968 for_sale 17950.0 7.0 8.0 0.23 535 W End Ave Unit 12TH, New York City, NY, 10024 New York City New York 10024.0 8451.0 2011-01-05 2011 2124.008993 0.875000
917641 for_sale 16000.0 7.0 9.0 47.16 187 Umpawaug Rd, Redding, CT, 06896 Redding Connecticut 6896.0 7590.0 2012-11-28 2012 2108.036891 0.777778
743414 for_sale 45000.0 4.0 10.0 0.05 12 E 82nd St, New York City, NY, 10028 New York City New York 10028.0 12000.0 2013-02-05 2013 3750.000000 0.400000
463153 for_sale 32000.0 6.0 12.0 10.00 16 Hurlingham Dr, Greenwich, CT, 06831 Greenwich Connecticut 6831.0 23700.0 2014-02-21 2014 1350.210970 0.500000
898460 for_sale 19000.0 4.0 5.0 0.05 18 W 11th St, New York City, NY, 10011 New York City New York 10011.0 6000.0 2015-06-09 2015 3166.666667 0.800000
781352 for_sale 18250.0 5.0 6.0 0.05 116 E 65th St, New York City, NY, 10065 New York City New York 10065.0 6950.0 2016-04-28 2016 2625.899281 0.833333
742416 for_sale 16000.0 7.0 12.0 0.05 224 W 22nd St, New York City, NY, 10011 New York City New York 10011.0 7640.0 2017-01-20 2017 2094.240838 0.583333
751525 for_sale 12995.0 4.0 4.0 0.02 40 Leroy St, New York City, NY, 10014 New York City New York 10014.0 3650.0 2018-07-27 2018 3560.273973 1.000000
640427 for_sale 29900.0 5.0 3.0 209.50 944 Providence Rd, Newtown Square, PA, 19073 Newtown Square Pennsylvania 19073.0 2900.0 2019-02-07 2019 10310.344828 1.666667
461367 for_sale 13650.0 7.0 10.0 2.53 32 Grahampton Ln, Greenwich, CT, 06830 Greenwich Connecticut 6830.0 10534.0 2020-06-26 2020 1295.804063 0.700000
656439 for_sale 12500.0 5.0 6.0 0.66 5499 Dune Dr, Avalon, NJ, 08202 Avalon New Jersey 8202.0 4462.0 2021-01-11 2021 2801.434334 0.833333
678035 for_sale 11500.0 9.0 13.0 0.85 4816 5th Ave, Avalon, NJ, 08202 Avalon New Jersey 8202.0 10000.0 2022-12-31 2022 1150.000000 0.692308

At which states are the most expensive real estate sold located in?¶

Out of 22 years, the New York state sold the most expensive real estate in the US for 12 years, outbeating other states by at least 50%. Moreover, this may mean that real estates in New York state tends to appreciate more and/or has higher value due to higher living standards.

In [21]:
#3b At which states are the most expensive real estates sold located in? 
plot = Most_Expensive_Real_Estate['state'].value_counts().plot(kind = "bar", 
                                                                color = ['Orange', 'Brown', 'Blue', 'Purple', "Green"])

plt.xticks(rotation = 0)
plt.xlabel("State")
plt.ylabel("Most Expensive Real Estate Count")
plt.title("Most Expensive Real Estate Count in the Top 5 States")
Out[21]:
Text(0.5, 1.0, 'Most Expensive Real Estate Count in the Top 5 States')

Relationship Analysis¶

Identifying the relationships between number of bathrooms, bedrooms, acre_lot and house size to the sale price.¶

From the correlation matrix shown below, turns out number of bathrooms has the strongest positive linear relationship to the sale price compared to other variables. It may be unexpected that the size of the real estate is not the variable with the strongest linear relationship. However, its important to point out that there are other factors that influences real estate prices such as the environment, location choice and the ability of the property to demand higher rents which is influenced by the purpose of using the land.

In [22]:
# Selecting columns for correlation calculation
columns = ['Sale_Price_Thousands', 'bath', 'bed', 'acre_lot', 'house_size']
subset = top_5_states[columns]

display(subset.corr())
Sale_Price_Thousands bath bed acre_lot house_size
Sale_Price_Thousands 1.000000 0.554682 0.296698 0.008957 0.431435
bath 0.554682 1.000000 0.618261 0.007481 0.600317
bed 0.296698 0.618261 1.000000 -0.000537 0.465522
acre_lot 0.008957 0.007481 -0.000537 1.000000 0.004948
house_size 0.431435 0.600317 0.465522 0.004948 1.000000

House Size Vs House Price¶

The the correlation between House Size and House Price is 0.4314. Which shows that House size does not have a strong positive relationship with the housing prices

In [23]:
correlation = final_real_estate[['house_size', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='house_size', y='Sale_Price_Thousands', alpha=0.4)
plt.title("House Size vs. Price")
plt.xlabel("House Size (sqft)")
plt.ylabel("Price ($)")
plt.show()

Number of Bathrooms vs Price¶

The number of bathrom has the strongest positive relationship with the housing price based on the correlation matrix (0.55), however it is important to note that there are other factors that would contribute to housing prices such as the population density of the state and other aspects of the property itself.

In [24]:
correlation = final_real_estate[['bath', 'Sale_Price_Thousands']].corr()

# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='bath', y='Sale_Price_Thousands', alpha=0.4)
plt.title("Number of Bathrooms vs. Price")
plt.xlabel("Number of Bathrooms")
plt.ylabel("Price ($)")
plt.show()

What type of house is the most popular over the years? (Based on the number of bathrooms and bedrooms)¶

Based on the table below, a 3 bedroom and 2 bathroom property is the most popular real estate in terms of sales volume every year for the past 22 years. This maybe due to the fact that this is the average family size in the US which drives the most demand for this type of property structure. Moreover, the spike of property sales during 2021 may be because of the covid 19 pandemic which may have affected the financial wellbeing of alot of people, which led to the sale of property. Moreover, the drop in property sales on 2022 may be due to insufficient data collected for the year.

In [35]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx = raw_top_house.groupby(['year'])['counts'].transform(max) == raw_top_house['counts']
# Will only return rows that matches True
popular_property = raw_top_house[idx]


import plotly.express as px

# Create the line chart
fig = px.line(
    popular_property,
    x='year',
    y='counts',
    markers=True,
    title= "Sales Trend for the Most demanded property over the years (3 bedroom and 2 bathrooms)",
    labels={'year': 'Year', 'counts': 'Number of Listings'}
)


# Customize hover tooltip (optional)
fig.update_traces(hovertemplate='Year: %{x}<br>Listings: %{y}')


# Show the interactive chart
fig.show(renderer = "notebook")

What are the least popular property in each of the year?¶

From the results, property with unbalanced bathroom to bedroom ratio property or properties that are much larger in size (bedroom and bathroom numbers) tend to be less popular due to the average household size which makes such properties harder to sell.

In [26]:
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']

# Generating the index to identify the max counts for each year (idx is in boolean)
idx_min = raw_top_house.groupby(['year'])['counts'].transform(min) == raw_top_house['counts']
# Will only return rows that matches True

pd.set_option('display.max_rows', 500) 
display(raw_top_house[idx_min])
year bed bath counts
34 2000 11.0 12.0 1
35 2000 11.0 6.0 1
36 2000 11.0 11.0 1
37 2000 9.0 9.0 1
38 2000 7.0 6.0 1
39 2000 1.0 3.0 1
40 2000 3.0 5.0 1
41 2000 3.0 6.0 1
42 2000 4.0 6.0 1
43 2000 6.0 12.0 1
44 2000 7.0 3.0 1
45 2000 7.0 8.0 1
46 2000 9.0 7.0 1
47 2000 7.0 10.0 1
48 2000 8.0 2.0 1
49 2000 8.0 5.0 1
50 2000 8.0 6.0 1
51 2000 8.0 13.0 1
52 2000 9.0 3.0 1
88 2001 6.0 9.0 1
89 2001 1.0 3.0 1
90 2001 3.0 5.0 1
91 2001 4.0 8.0 1
92 2001 5.0 8.0 1
93 2001 5.0 10.0 1
94 2001 7.0 8.0 1
95 2001 7.0 9.0 1
96 2001 8.0 2.0 1
97 2001 8.0 8.0 1
98 2001 9.0 9.0 1
99 2001 10.0 12.0 1
100 2001 10.0 14.0 1
101 2001 12.0 9.0 1
102 2001 12.0 6.0 1
137 2002 2.0 4.0 1
138 2002 9.0 5.0 1
139 2002 10.0 5.0 1
140 2002 9.0 4.0 1
141 2002 3.0 8.0 1
142 2002 6.0 10.0 1
143 2002 4.0 7.0 1
144 2002 4.0 8.0 1
145 2002 5.0 1.0 1
146 2002 5.0 9.0 1
147 2002 6.0 8.0 1
148 2002 6.0 9.0 1
149 2002 6.0 11.0 1
150 2002 8.0 8.0 1
151 2002 6.0 12.0 1
152 2002 7.0 10.0 1
153 2002 7.0 11.0 1
154 2002 8.0 3.0 1
155 2002 8.0 5.0 1
156 2002 8.0 6.0 1
193 2003 7.0 11.0 1
194 2003 1.0 4.0 1
195 2003 3.0 5.0 1
196 2003 4.0 7.0 1
197 2003 5.0 9.0 1
198 2003 6.0 8.0 1
199 2003 7.0 7.0 1
200 2003 7.0 13.0 1
201 2003 8.0 1.0 1
202 2003 8.0 7.0 1
203 2003 9.0 6.0 1
204 2003 9.0 8.0 1
205 2003 9.0 10.0 1
252 2004 9.0 10.0 1
253 2004 10.0 3.0 1
254 2004 13.0 14.0 1
255 2004 15.0 9.0 1
256 2004 16.0 9.0 1
257 2004 8.0 7.0 1
258 2004 8.0 13.0 1
259 2004 7.0 12.0 1
260 2004 7.0 8.0 1
261 2004 4.0 7.0 1
307 2005 12.0 3.0 1
308 2005 15.0 6.0 1
309 2005 10.0 5.0 1
310 2005 11.0 9.0 1
311 2005 10.0 4.0 1
312 2005 5.0 10.0 1
313 2005 2.0 4.0 1
314 2005 7.0 7.0 1
315 2005 9.0 10.0 1
316 2005 7.0 9.0 1
317 2005 7.0 10.0 1
318 2005 7.0 11.0 1
319 2005 8.0 7.0 1
320 2005 9.0 7.0 1
321 2005 9.0 9.0 1
362 2006 5.0 9.0 1
363 2006 1.0 4.0 1
364 2006 4.0 6.0 1
365 2006 4.0 7.0 1
366 2006 4.0 9.0 1
367 2006 6.0 8.0 1
368 2006 6.0 9.0 1
369 2006 6.0 10.0 1
370 2006 7.0 7.0 1
371 2006 7.0 8.0 1
372 2006 7.0 9.0 1
373 2006 8.0 2.0 1
374 2006 8.0 7.0 1
375 2006 8.0 6.0 1
376 2006 8.0 9.0 1
377 2006 8.0 10.0 1
378 2006 9.0 5.0 1
379 2006 9.0 14.0 1
380 2006 9.0 17.0 1
421 2007 12.0 6.0 1
422 2007 12.0 13.0 1
423 2007 12.0 2.0 1
424 2007 7.0 8.0 1
425 2007 1.0 2.0 1
426 2007 1.0 3.0 1
427 2007 6.0 10.0 1
428 2007 7.0 10.0 1
429 2007 11.0 5.0 1
430 2007 8.0 5.0 1
431 2007 8.0 6.0 1
432 2007 8.0 11.0 1
433 2007 9.0 3.0 1
434 2007 9.0 9.0 1
435 2007 11.0 3.0 1
479 2008 8.0 13.0 1
480 2008 2.0 4.0 1
481 2008 5.0 1.0 1
482 2008 5.0 7.0 1
483 2008 6.0 1.0 1
484 2008 6.0 10.0 1
485 2008 8.0 8.0 1
486 2008 9.0 6.0 1
487 2008 9.0 9.0 1
488 2008 10.0 3.0 1
489 2008 10.0 4.0 1
490 2008 11.0 3.0 1
491 2008 11.0 6.0 1
492 2008 11.0 8.0 1
493 2008 13.0 7.0 1
494 2008 19.0 19.0 1
530 2009 3.0 7.0 1
531 2009 6.0 9.0 1
532 2009 14.0 15.0 1
533 2009 7.0 5.0 1
534 2009 7.0 9.0 1
535 2009 8.0 3.0 1
536 2009 8.0 10.0 1
537 2009 10.0 6.0 1
538 2009 12.0 6.0 1
573 2010 11.0 5.0 1
574 2010 12.0 6.0 1
575 2010 20.0 18.0 1
576 2010 8.0 5.0 1
577 2010 8.0 11.0 1
578 2010 8.0 2.0 1
579 2010 1.0 2.0 1
580 2010 7.0 6.0 1
581 2010 3.0 5.0 1
582 2010 5.0 1.0 1
583 2010 5.0 7.0 1
584 2010 5.0 8.0 1
585 2010 6.0 8.0 1
586 2010 7.0 2.0 1
620 2011 7.0 8.0 1
621 2011 5.0 8.0 1
622 2011 1.0 3.0 1
623 2011 4.0 6.0 1
624 2011 4.0 7.0 1
625 2011 6.0 10.0 1
626 2011 7.0 2.0 1
627 2011 7.0 6.0 1
628 2011 7.0 10.0 1
629 2011 7.0 12.0 1
630 2011 8.0 3.0 1
631 2011 8.0 5.0 1
632 2011 9.0 3.0 1
633 2011 9.0 6.0 1
634 2011 10.0 3.0 1
635 2011 11.0 7.0 1
636 2011 12.0 5.0 1
672 2012 22.0 15.0 1
673 2012 9.0 6.0 1
674 2012 9.0 14.0 1
675 2012 10.0 7.0 1
676 2012 12.0 3.0 1
677 2012 12.0 9.0 1
678 2012 13.0 3.0 1
679 2012 9.0 4.0 1
680 2012 9.0 3.0 1
681 2012 8.0 10.0 1
682 2012 8.0 7.0 1
683 2012 8.0 2.0 1
684 2012 8.0 1.0 1
685 2012 7.0 5.0 1
686 2012 6.0 12.0 1
687 2012 6.0 10.0 1
688 2012 6.0 9.0 1
689 2012 6.0 7.0 1
690 2012 5.0 9.0 1
691 2012 4.0 9.0 1
692 2012 4.0 8.0 1
693 2012 2.0 5.0 1
694 2012 2.0 4.0 1
730 2013 10.0 5.0 1
731 2013 8.0 2.0 1
732 2013 8.0 6.0 1
733 2013 8.0 8.0 1
734 2013 9.0 9.0 1
735 2013 10.0 3.0 1
736 2013 10.0 4.0 1
737 2013 10.0 7.0 1
738 2013 13.0 8.0 1
739 2013 7.0 10.0 1
740 2013 7.0 4.0 1
741 2013 6.0 12.0 1
742 2013 6.0 8.0 1
743 2013 2.0 4.0 1
744 2013 2.0 5.0 1
745 2013 3.0 5.0 1
746 2013 3.0 6.0 1
747 2013 4.0 10.0 1
748 2013 5.0 8.0 1
749 2013 5.0 9.0 1
788 2014 12.0 3.0 1
789 2014 12.0 8.0 1
790 2014 12.0 16.0 1
791 2014 8.0 10.0 1
792 2014 11.0 4.0 1
793 2014 8.0 6.0 1
794 2014 8.0 5.0 1
795 2014 2.0 4.0 1
796 2014 6.0 11.0 1
797 2014 6.0 12.0 1
798 2014 7.0 9.0 1
799 2014 7.0 14.0 1
838 2015 5.0 10.0 1
839 2015 5.0 1.0 1
840 2015 5.0 8.0 1
841 2015 5.0 9.0 1
842 2015 6.0 8.0 1
843 2015 7.0 8.0 1
844 2015 7.0 9.0 1
845 2015 7.0 12.0 1
846 2015 9.0 4.0 1
847 2015 10.0 4.0 1
848 2015 10.0 7.0 1
849 2015 22.0 19.0 1
850 2015 14.0 6.0 1
886 2016 2.0 4.0 1
887 2016 5.0 8.0 1
888 2016 10.0 5.0 1
889 2016 7.0 5.0 1
890 2016 8.0 2.0 1
891 2016 9.0 4.0 1
892 2016 9.0 5.0 1
893 2016 9.0 6.0 1
894 2016 10.0 2.0 1
895 2016 10.0 7.0 1
896 2016 10.0 10.0 1
897 2016 11.0 6.0 1
898 2016 12.0 12.0 1
940 2017 14.0 17.0 1
941 2017 14.0 4.0 1
942 2017 14.0 8.0 1
943 2017 13.0 12.0 1
944 2017 13.0 3.0 1
945 2017 9.0 4.0 1
946 2017 7.0 9.0 1
947 2017 7.0 10.0 1
948 2017 7.0 12.0 1
949 2017 8.0 2.0 1
950 2017 8.0 6.0 1
951 2017 9.0 2.0 1
952 2017 9.0 8.0 1
953 2017 6.0 1.0 1
954 2017 9.0 9.0 1
955 2017 10.0 10.0 1
956 2017 11.0 6.0 1
957 2017 12.0 3.0 1
958 2017 12.0 5.0 1
959 2017 12.0 6.0 1
960 2017 6.0 9.0 1
961 2017 5.0 9.0 1
962 2017 4.0 10.0 1
963 2017 2.0 4.0 1
964 2017 3.0 6.0 1
965 2017 4.0 7.0 1
1011 2018 5.0 1.0 1
1012 2018 14.0 17.0 1
1013 2018 9.0 8.0 1
1014 2018 10.0 2.0 1
1015 2018 10.0 5.0 1
1016 2018 11.0 4.0 1
1017 2018 11.0 8.0 1
1018 2018 12.0 4.0 1
1019 2018 14.0 7.0 1
1020 2018 8.0 8.0 1
1021 2018 5.0 9.0 1
1022 2018 6.0 10.0 1
1023 2018 7.0 6.0 1
1024 2018 8.0 2.0 1
1063 2019 14.0 8.0 1
1064 2019 13.0 8.0 1
1065 2019 12.0 16.0 1
1066 2019 12.0 5.0 1
1067 2019 12.0 4.0 1
1068 2019 11.0 4.0 1
1069 2019 11.0 3.0 1
1070 2019 10.0 4.0 1
1071 2019 9.0 7.0 1
1072 2019 9.0 5.0 1
1073 2019 8.0 11.0 1
1074 2019 7.0 9.0 1
1075 2019 7.0 6.0 1
1076 2019 7.0 5.0 1
1077 2019 7.0 2.0 1
1078 2019 6.0 11.0 1
1079 2019 6.0 1.0 1
1080 2019 2.0 4.0 1
1125 2020 4.0 7.0 1
1126 2020 7.0 7.0 1
1127 2020 7.0 8.0 1
1128 2020 19.0 16.0 1
1129 2020 16.0 6.0 1
1130 2020 12.0 9.0 1
1131 2020 12.0 4.0 1
1132 2020 12.0 2.0 1
1133 2020 11.0 4.0 1
1134 2020 10.0 5.0 1
1135 2020 10.0 3.0 1
1136 2020 9.0 6.0 1
1137 2020 8.0 11.0 1
1138 2020 8.0 2.0 1
1139 2020 7.0 10.0 1
1182 2021 10.0 8.0 1
1183 2021 9.0 5.0 1
1184 2021 10.0 6.0 1
1185 2021 10.0 3.0 1
1186 2021 10.0 2.0 1
1187 2021 11.0 3.0 1
1188 2021 13.0 5.0 1
1189 2021 14.0 5.0 1
1190 2021 9.0 2.0 1
1191 2021 8.0 8.0 1
1192 2021 8.0 6.0 1
1193 2021 8.0 4.0 1
1194 2021 8.0 2.0 1
1195 2021 7.0 9.0 1
1196 2021 7.0 2.0 1
1197 2021 7.0 1.0 1
1198 2021 3.0 6.0 1
1199 2021 3.0 5.0 1
1228 2022 6.0 5.0 1
1229 2022 6.0 9.0 1
1230 2022 18.0 9.0 1
1231 2022 7.0 8.0 1
1232 2022 9.0 10.0 1
1233 2022 8.0 4.0 1
1234 2022 7.0 3.0 1
1235 2022 7.0 2.0 1
1236 2022 8.0 6.0 1
1237 2022 2.0 4.0 1
1238 2022 8.0 10.0 1
1239 2022 5.0 1.0 1
1240 2022 9.0 13.0 1
1241 2022 10.0 6.0 1
1242 2022 1.0 2.0 1
1243 2022 6.0 10.0 1
1244 2022 10.0 7.0 1
1245 2022 11.0 5.0 1
1246 2022 17.0 12.0 1
1247 2022 9.0 6.0 1
1248 2022 47.0 39.0 1
In [32]:
has_duplicates = final_real_estate.duplicated().any()
In [28]:
# Splitting the dataset by popularity 
#1 Count how many times each (bed, bath) combination appears 

combo_counts = final_real_estate.groupby(['bed', 'bath']).size().reset_index(name = 'Sales Count (Bed and Bath)')

#2 Merge the counts to the original dataset
final_real_estate_counts = final_real_estate.merge(combo_counts, on = ['bed', 'bath'], how = 'left')

#3 use quantiles to define popularity levels 
q_low = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.33)
q_high = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.66)

# Create a new column for popularity
def popularity_label(count):
    if count <= q_low:
        return 'Least Popular'
    elif count <= q_high:
        return 'Average Popularity'
    else:
        return 'Most Popular'
# apply the function to the dataset
final_real_estate_counts['popularity'] = final_real_estate_counts['Sales Count (Bed and Bath)'].apply(popularity_label)        

What makes a property less popular? (Based on bed to bathroom ratio)¶

From the below table, it is showned that majority of the property with room ratio of less than 1 or more than 1.5 are categorised as less popular. Generally speaking less popular property do not have a very good balance of bedroom and bathroom, or property with too much bedrooms and bathroom tend to less attractive as it exceed the average people per household in the US and it may be difficult for investors to resell it due to its size and demand for the property.

In [29]:
least_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Least Popular']

display(least_popular_property.groupby(['bed', 'bath', 'room_ratio']).size().reset_index())
bed bath room_ratio 0
0 1.0 1.0 1.000000 347
1 1.0 2.0 0.500000 78
2 1.0 3.0 0.333333 4
3 1.0 4.0 0.250000 2
4 2.0 1.0 2.000000 1366
5 2.0 3.0 0.666667 346
6 2.0 4.0 0.500000 25
7 2.0 5.0 0.400000 2
8 3.0 4.0 0.750000 456
9 3.0 5.0 0.600000 56
10 3.0 6.0 0.500000 10
11 3.0 7.0 0.428571 1
12 3.0 8.0 0.375000 1
13 4.0 1.0 4.000000 307
14 4.0 4.0 1.000000 1187
15 4.0 5.0 0.800000 329
16 4.0 6.0 0.666667 79
17 4.0 7.0 0.571429 25
18 4.0 8.0 0.500000 3
19 4.0 9.0 0.444444 2
20 4.0 10.0 0.400000 2
21 5.0 1.0 5.000000 24
22 5.0 2.0 2.500000 432
23 5.0 3.0 1.666667 695
24 5.0 4.0 1.250000 576
25 5.0 5.0 1.000000 379
26 5.0 6.0 0.833333 202
27 5.0 7.0 0.714286 90
28 5.0 8.0 0.625000 25
29 5.0 9.0 0.555556 15
30 5.0 10.0 0.500000 3
31 6.0 1.0 6.000000 3
32 6.0 2.0 3.000000 227
33 6.0 3.0 2.000000 285
34 6.0 4.0 1.500000 198
35 6.0 5.0 1.200000 141
36 6.0 6.0 1.000000 104
37 6.0 7.0 0.857143 91
38 6.0 8.0 0.750000 65
39 6.0 9.0 0.666667 37
40 6.0 10.0 0.600000 19
41 6.0 11.0 0.545455 8
42 6.0 12.0 0.500000 5
43 7.0 1.0 7.000000 1
44 7.0 2.0 3.500000 27
45 7.0 3.0 2.333333 84
46 7.0 4.0 1.750000 57
47 7.0 5.0 1.400000 47
48 7.0 6.0 1.166667 23
49 7.0 7.0 1.000000 26
50 7.0 8.0 0.875000 24
51 7.0 9.0 0.777778 20
52 7.0 10.0 0.700000 10
53 7.0 11.0 0.636364 5
54 7.0 12.0 0.583333 4
55 7.0 13.0 0.538462 1
56 7.0 14.0 0.500000 1
57 8.0 1.0 8.000000 2
58 8.0 2.0 4.000000 15
59 8.0 3.0 2.666667 66
60 8.0 4.0 2.000000 56
61 8.0 5.0 1.600000 26
62 8.0 6.0 1.333333 19
63 8.0 7.0 1.142857 5
64 8.0 8.0 1.000000 6
65 8.0 9.0 0.888889 4
66 8.0 10.0 0.800000 5
67 8.0 11.0 0.727273 4
68 8.0 13.0 0.615385 3
69 9.0 2.0 4.500000 2
70 9.0 3.0 3.000000 65
71 9.0 4.0 2.250000 24
72 9.0 5.0 1.800000 13
73 9.0 6.0 1.500000 15
74 9.0 7.0 1.285714 3
75 9.0 8.0 1.125000 3
76 9.0 9.0 1.000000 7
77 9.0 10.0 0.900000 6
78 9.0 13.0 0.692308 1
79 9.0 14.0 0.642857 2
80 9.0 17.0 0.529412 1
81 10.0 2.0 5.000000 3
82 10.0 3.0 3.333333 8
83 10.0 4.0 2.500000 16
84 10.0 5.0 2.000000 8
85 10.0 6.0 1.666667 3
86 10.0 7.0 1.428571 5
87 10.0 8.0 1.250000 3
88 10.0 10.0 1.000000 2
89 10.0 12.0 0.833333 1
90 10.0 14.0 0.714286 1
91 11.0 3.0 3.666667 4
92 11.0 4.0 2.750000 4
93 11.0 5.0 2.200000 3
94 11.0 6.0 1.833333 4
95 11.0 7.0 1.571429 1
96 11.0 8.0 1.375000 2
97 11.0 9.0 1.222222 1
98 11.0 11.0 1.000000 1
99 11.0 12.0 0.916667 1
100 12.0 2.0 6.000000 2
101 12.0 3.0 4.000000 6
102 12.0 4.0 3.000000 3
103 12.0 5.0 2.400000 3
104 12.0 6.0 2.000000 9
105 12.0 7.0 1.714286 2
106 12.0 8.0 1.500000 1
107 12.0 9.0 1.333333 3
108 12.0 12.0 1.000000 1
109 12.0 13.0 0.923077 1
110 12.0 16.0 0.750000 2
111 13.0 3.0 4.333333 2
112 13.0 5.0 2.600000 1
113 13.0 7.0 1.857143 1
114 13.0 8.0 1.625000 2
115 13.0 12.0 1.083333 1
116 13.0 14.0 0.928571 1
117 14.0 4.0 3.500000 1
118 14.0 5.0 2.800000 1
119 14.0 6.0 2.333333 1
120 14.0 7.0 2.000000 1
121 14.0 8.0 1.750000 2
122 14.0 15.0 0.933333 1
123 14.0 17.0 0.823529 2
124 15.0 6.0 2.500000 1
125 15.0 9.0 1.666667 1
126 16.0 6.0 2.666667 1
127 16.0 9.0 1.777778 1
128 17.0 12.0 1.416667 1
129 18.0 9.0 2.000000 1
130 19.0 16.0 1.187500 1
131 19.0 19.0 1.000000 1
132 20.0 18.0 1.111111 1
133 22.0 15.0 1.466667 1
134 22.0 19.0 1.157895 1
135 47.0 39.0 1.205128 1

What makes a property popular? (Based on bed to bathroom ratio)¶

From the analysis, a 3 bedroom and 2-3 bathroom property are the most popular because the demand for a property is heavily driven by the usage of the property, on average a US household consist of 3.15 people, thus property with this size is the most popular.

In [30]:
most_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Most Popular']

# identify the unique room_ratio 
most_popular_property.groupby(['bed', 'bath','room_ratio']).size().reset_index()
Out[30]:
bed bath room_ratio 0
0 3.0 2.0 1.5 5255
1 3.0 3.0 1.0 2846

CONCLUSION¶

The top 5 States for real estate sales in the US are New Jersey, Connecticut, New York, Pennsylvania and Massachusetts for the past 22 years. If this trend continues, the real estates in these states may be potential investment options as the market is quite active, thus more liquid. However, real estates in these states may be potentially more expensive, but at the same time these states may have more employment opportunities. If quality of life such as pollution rates are of concern, other less crowded states maybe better choices for living.

Moreover, real estate values are mainly driven by value generating drivers such as location and the property's ability to generate rent rather than just real estate size. It is important to note that the balance between bathroom and bedroom is very important indicating the liquidity of a property, demands for property is mainly driven by needs of the buyer. If the buying motive is for investment purposes (potential future sale or rent generation), it is important to look into what the target market is demanding.

Furthermore, a 3 bedroom and 2 bathroom real estate is the most demanded real estate for the past 22 years and if this trend continues to the future, this is mainly driven by the average household size in the US. This type of real estate would be a safe and great investment option.